Build a basic campaign prioritization model using all relevant variables extracted from the database and identified in previous work.
library(tidyverse)
library(reshape2)
library(gridExtra)
library(splines)
library(lubridate)
library(wranglR)
library(Boruta)
library(foreach)
library(doParallel)
library(glmnet)
library(glmnetUtils)
# Functions adapted from previous analysis steps
source('code/functions.R')
# Visualization functions adapted fron previous analysis steps
source('code/functions_viz.R')
# Set number of available CPU cores
registerDoParallel(detectCores() - 1)
The overarching goal is to predict giving over the final two years of the campaign. Ideally, I’d want to find expected future value, not just difference from expected value today. Consider the following:
\[ E \left( \text{giving, donor | covariates} \right) = E \left(\text{giving | donor, covariates} \right) P \left(\text{donor | covariates} \right) \]
Estimate the expected future value as the product of an expected value and a probability. This can also be thought of as separate capacity and affinity models, and should give more useful estimates than \(E\left( \text{giving | covariates} \right)\), which is left-censored by $0.
It’ll be informative seeing what features are more or less important at each stage of the two-step procedure, though I expect overall accuracy to suffer somewhat. Down the road it would be interesting to compare this to other methods, like trees and boosting.
The target variable is the sum of new gifts and commitments from 9/1/16 to 8/31/18 (FY17-18), given the state of the database on 8/31/16 (FY16).
As a general principle, point-in-time data is derived from entered date ranges where possible. Where dates are missing, it will be based upon the date added or date modified audit trail for each field, as suitable. The following data types received this treatment:
This is implemented by this SQL code.
# Parameters
train_fy <- 2016
filepath <- 'data/2018-11-30 point-in-time data.xlsx'
sheetname <- 'Select point_in_time_model'
# Import data
source('code/generate-pit-data.R')
# Run data generation function
modeling.data <- generate_pit_data(filepath, sheetname)
27249 failed to parse.
# Create response variables
modeling.data <- modeling.data %>% mutate(
rv.amt = NGC_TARGET_FY2 + NGC_TARGET_FY1
, rv.gave = rv.amt > 0
) %>% select(
# Drop future data
-NGC_TARGET_FY2
, -NGC_TARGET_FY1
, -CASH_TARGET_FY2
, -CASH_TARGET_FY1
, -PLEDGE_TARGET_FY2
, -PLEDGE_TARGET_FY1
, -AF_TARGET_FY2
, -AF_TARGET_FY1
, -CRU_TARGET_FY2
, -CRU_TARGET_FY1
) %>% filter(
# Drop entities whose RECORD_YR is after the training year
RECORD_YR <= train_fy
)
Logistic regression has been the workhorse of fundraising models for years. Some special considerations for this application:
Lifetime.Giving as a predictor if the response variable is Largest.Gift.I have previously found that penalized logistic regression, such as implemented in R by the glmnet package, works better than standard logistic regression, so that’s the technique that I’ll use here.
Here, the response variable is:
\[ Y_i = I \left( \text{FY18Giving}_i + \text{FY17Giving}_i > 0 \right) \]
I like computing random forest variable importance, e.g. Sauve & Tuleau-Malot (2014), to pre-screen variables. Define variable importance in a random forest as the change in MSE when permuting a given observation vector. One nice feature is that highly correlated variables should be similarly important.
# Sample rows
prop = 1/5 # Proportion of data to sample
set.seed(287092)
samp <- sample_n(modeling.data, size = nrow(modeling.data) * prop)
# Run Boruta algorithm
rf.vars <- Boruta(
y = as.numeric(samp$rv.gave)
, x = samp %>% select(-rv.amt, -rv.gave)
, seed = 5993207
)
Computing permutation importance.. Progress: 65%. Estimated remaining time: 16 seconds.
Computing permutation importance.. Progress: 63%. Estimated remaining time: 18 seconds.
Computing permutation importance.. Progress: 64%. Estimated remaining time: 17 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 16 seconds.
Computing permutation importance.. Progress: 70%. Estimated remaining time: 13 seconds.
Computing permutation importance.. Progress: 69%. Estimated remaining time: 13 seconds.
Computing permutation importance.. Progress: 69%. Estimated remaining time: 13 seconds.
Computing permutation importance.. Progress: 69%. Estimated remaining time: 13 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 14 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 15 seconds.
Computing permutation importance.. Progress: 69%. Estimated remaining time: 13 seconds.
Computing permutation importance.. Progress: 70%. Estimated remaining time: 13 seconds.
Computing permutation importance.. Progress: 69%. Estimated remaining time: 14 seconds.
Computing permutation importance.. Progress: 68%. Estimated remaining time: 14 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 3 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 3 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 3 seconds.
Computing permutation importance.. Progress: 88%. Estimated remaining time: 4 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 3 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 1 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 93%. Estimated remaining time: 2 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
rf.vars %>% print()
Boruta performed 99 iterations in 1.03996 hours.
101 attributes confirmed important: AF_PFY1, AF_PFY2, AF_PFY3, AF_PFY4, AF_PFY5 and 96
more;
43 attributes confirmed unimportant: ACTIVITIES_CFY, ACTIVITIES_PFY1,
COMMITTEE_KSM_ACTIVE, COMMITTEE_KSM_LDR, COMMITTEE_KSM_LDR_ACTIVE and 38 more;
7 tentative attributes left: ACTIVITIES_PFY2, ACTIVITIES_PFY3, ACTIVITIES_PFY4,
HAS_BUS_EMAIL, HOUSEHOLD_CITY and 2 more;
Save the results.
save(rf.vars, file = 'data/rf.vars.Rdata')
Plot the results.
(pmod_plot <- rf.vars %>% Borutadata() %>% Borutaplotter())
Basically, the algorithm creates dummy “shadow” variables, which are permuted versions of the explanatory variables appearing above, and random forests are fit on both the real and dummy variables. Intuitively, if replacing a variable with a randomly permuted version of itself does not reduce the random forest classifier’s accuracy, then the variable should not be included in a final model and can be discarded.
Recall that the response variable is making a new gift or commitment at any level within the next two years. From past experience, I know that most donations are outright gifts, under $1,000, and to an annual giving allocation. So the following is not too surprising:
I found these more surprising:
(recommended.vars <- TentativeRoughFix(rf.vars))
Boruta performed 99 iterations in 1.03996 hours.
Tentatives roughfixed over the last 99 iterations.
101 attributes confirmed important: AF_PFY1, AF_PFY2, AF_PFY3, AF_PFY4, AF_PFY5 and 96
more;
50 attributes confirmed unimportant: ACTIVITIES_CFY, ACTIVITIES_PFY1, ACTIVITIES_PFY2,
ACTIVITIES_PFY3, ACTIVITIES_PFY4 and 45 more;
# Check variable correlations
recommended_vars <- recommended.vars$finalDecision[
which(recommended.vars$finalDecision == 'Confirmed')] %>% names()
numeric_vars <- modeling.data %>%
select(recommended_vars) %>%
select(-ID_NUMBER, -HOUSEHOLD_ID) %>%
select_if(is.numeric)
numeric_vars %>% plot_corrs(textsize = 2)
This is the correlation matrix for all 74 numeric variables confirmed important by the algorithm.
Begin by creating the modeling data file.
# Data file with variables removed
mdat <- modeling.data %>% select(rv.gave, recommended_vars) %>%
select(
-VELOCITY3_NGC, -VELOCITY_BINS_NGC, -VELOCITY_BINS_CASH, -VELOCITY3_LIN_NGC
, -GIVING_MAX_PLEDGE_YR, -GIVING_MAX_PLEDGE_FY, -CRU_STATUS
, -NGC_PFY1, -NGC_PFY2, -NGC_PFY3, -NGC_PFY4, -NGC_PFY5
, -AF_PFY1, -AF_PFY2, -AF_PFY3, -AF_PFY4, -AF_PFY5
, -GIVING_MAX_CASH_FY, -GIVING_NGC_TOTAL, -UPGRADE3_NGC, -LOYAL_5_PCT_ANY
, -DEGREES_CONCAT, -BIRTH_DT, -FIRST_KSM_YEAR
, -ID_NUMBER, -INSTITUTIONAL_SUFFIX # Keep HHID but don't use in modeling
, -KSM_GOS, -HOUSEHOLD_COUNTRY
, -KSM_EVENTS_ATTENDED, -EVENTS_ATTENDED
) %>% mutate(
# Create spouse flag
SPOUSE_ALUM = ifelse(SPOUSE_FIRST_KSM_YEAR > 0, 'TRUE', 'FALSE') %>% factor()
) %>% mutate_if(
# Numeric variables over 1E4 get a log10 transformation
function(x) {
ifelse(is.numeric(x), max(x) >= 1E4, FALSE)
}
, log10plus1
)
# Cross-validation settings
folds = 10
reps = 5
# Withhold 10% of data as test set
xv <- KFoldXVal(mdat, k = 2, prop = .1, seed = 4960582)
holdoutdat <- mdat[xv[[1]], ]
traindat <- mdat[xv[[2]], ]
remove(xv)
I’ll use a penalized ridge regression model as implemented by glmnet. Advantages of shrinkage techniques include automatically controlling for overfitting and collinearity.
# Store timings
glm_ridge_baseline_timestamps <- list()
# Store model errors
glm_nospline <- list()
# Seed for reproducibility
set.seed(2934223)
# Outer loop (repetitions)
for (rep in 1:reps) {
# Status report
timestamp <- paste('+ Iteration', rep, 'beginning at:', Sys.time())
print(timestamp)
glm_ridge_baseline_timestamps <- c(glm_ridge_baseline_timestamps, timestamp)
# Create cross-validation indices
xv <- KFoldXVal(traindat, k = folds)
# Inner loop (parallel cross-validation)
errs_out <- foreach(
fold = 1:length(xv)
, .combine = c
, .packages = c('glmnet', 'glmnetUtils', 'dplyr', 'splines')
) %dopar% {
# Fit temp model, where alpha = 0 is the ridge regression penalty
tmpmodel <- cv.glmnet(
rv.gave ~ .
# Train while withholding some data
, data = traindat[-xv[[fold]], ] %>% select(-HOUSEHOLD_ID)
, family = 'binomial'
, alpha = 0
, lambda = 2^(-8:5)
)
# Prediction threshold
theta1 <- sum(traindat$rv.gave[-xv[[fold]]] == 1) / nrow(traindat[-xv[[fold]], ])
# Confusion matrix based on the withheld data
tmpconfus <- conf_matrix_glmnet(tmpmodel, newdata = traindat[xv[[fold]], ], rv = 'rv.gave', threshold = theta1)
# Return results
return(
list(
conf_matrix = tmpconfus$conf_matrix
, conf_matrix_pct = tmpconfus$conf_matrix_pct
, errors = data.frame(
reps = rep
, folds = fold
, error = tmpconfus$error
, precision = tmpconfus$precision
, sensitivity = tmpconfus$sensitivity
, F1_score = tmpconfus$F1_score
)
)
)
}
# Write results to errors data frame
glm_nospline <- c(glm_nospline, errs_out)
# Status report
timestamp <- paste(' -Iteration', rep, 'ending at: ', Sys.time())
print(timestamp)
glm_ridge_baseline_timestamps <- c(glm_ridge_baseline_timestamps, timestamp)
}
[1] "+ Iteration 1 beginning at: 2018-12-21 11:47:32"
[1] " -Iteration 1 ending at: 2018-12-21 11:49:40"
[1] "+ Iteration 2 beginning at: 2018-12-21 11:49:40"
[1] " -Iteration 2 ending at: 2018-12-21 11:51:49"
[1] "+ Iteration 3 beginning at: 2018-12-21 11:51:49"
[1] " -Iteration 3 ending at: 2018-12-21 11:53:57"
[1] "+ Iteration 4 beginning at: 2018-12-21 11:53:57"
[1] " -Iteration 4 ending at: 2018-12-21 11:56:01"
[1] "+ Iteration 5 beginning at: 2018-12-21 11:56:01"
[1] " -Iteration 5 ending at: 2018-12-21 11:58:16"
glm_ridge_baseline_timestamps %>% unlist() %>% print()
[1] "+ Iteration 1 beginning at: 2018-12-21 11:47:32"
[2] " -Iteration 1 ending at: 2018-12-21 11:49:40"
[3] "+ Iteration 2 beginning at: 2018-12-21 11:49:40"
[4] " -Iteration 2 ending at: 2018-12-21 11:51:49"
[5] "+ Iteration 3 beginning at: 2018-12-21 11:51:49"
[6] " -Iteration 3 ending at: 2018-12-21 11:53:57"
[7] "+ Iteration 4 beginning at: 2018-12-21 11:53:57"
[8] " -Iteration 4 ending at: 2018-12-21 11:56:01"
[9] "+ Iteration 5 beginning at: 2018-12-21 11:56:01"
[10] " -Iteration 5 ending at: 2018-12-21 11:58:16"
# Function to reshape list data
combine_xval <- function(xval_results = list()) {
# Function to reformat list output into groups
delister <- function(full_list, first_idx = 1, seq) {
output <- list()
idx <- seq(first_idx, length(full_list), by = seq)
for (i in 1:length(idx)) {
output <- c(output, full_list[idx[i]])
}
return(output)
}
# Separate the output into groups of 3
conf_matrix = delister(xval_results, 1, 3)
conf_matrix_pct = delister(xval_results, 2, 3)
errors = delister(xval_results, 3, 3)
# Turn errors into a data frame
errors <- foreach(i = 1:length(errors), .combine = rbind) %do% {
return(errors[[i]])
} %>% data.frame()
# Return organized list
return(
list(
conf_matrix = conf_matrix
, conf_matrix_pct = conf_matrix_pct
, errors = errors
)
)
}
# Save results
glm_ridge_baseline_results <- combine_xval(glm_nospline)
glm_ridge_baseline_model <- cv.glmnet(
rv.gave ~ .
, data = traindat %>% select(-HOUSEHOLD_ID)
, family = 'binomial'
, alpha = 0
)
save(
glm_nospline
, glm_ridge_baseline_model
, glm_ridge_baseline_results
, glm_ridge_baseline_timestamps
, file = 'data/glm_ridge_baseline.Rdata'
)
grid.arrange(
histogrammer(glm_ridge_baseline_results$errors, 'error', h = .0005, fill = 'pink')
, histogrammer(glm_ridge_baseline_results$errors, 'precision', h = .005, fill = 'cyan')
, histogrammer(glm_ridge_baseline_results$errors, 'sensitivity', h = .005, fill = 'green')
)
Let TP, TN, FP, FN refer to true positives, true negatives, false positives, and false negatives respectively.
\[ \text{error} = \frac{FP + FN}{n}\] \[ \text{precision} = \frac{TP}{TP + FP}\] \[ \text{sensitivity} = \frac{TP}{TP + FN}\]
Compared to the AF $10K model, this has higher error due to the decreased sensitivity, but much higher precision.
The metrics to beat so far:
(
glm_baseline_err <- data.frame(
glm_ridge_baseline = glm_ridge_baseline_results$errors %>%
select(-reps, -folds) %>%
colMeans()
)
)
Consider a standard logistic regression model to get a better sense of the explanatory variables.
glm_standard <- glm(
rv.gave ~ .
, data = traindat %>% select(-HOUSEHOLD_ID) %>%
select(-RECORD_STATUS_CODE) # Results in separation if included
, family = 'binomial'
)
summary(glm_standard)
Call:
glm(formula = rv.gave ~ ., family = "binomial", data = traindat %>%
select(-HOUSEHOLD_ID) %>% select(-RECORD_STATUS_CODE))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9841 -0.2635 -0.1401 -0.0360 3.8324
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.499e+01 6.128e+00 -12.236 < 2e-16 ***
PROGRAM_GROUPNONE -1.188e+00 1.247e-01 -9.532 < 2e-16 ***
PROGRAM_GROUPEMP -1.106e-01 6.141e-02 -1.801 0.071769 .
PROGRAM_GROUPEXECED -3.005e+00 3.920e-01 -7.665 1.79e-14 ***
PROGRAM_GROUPNONGRD -1.433e+00 5.230e-01 -2.741 0.006128 **
PROGRAM_GROUPPHD 1.297e-01 2.074e-01 0.625 0.531723
PROGRAM_GROUPTMP -2.372e-01 5.288e-02 -4.486 7.26e-06 ***
SPOUSE_FIRST_KSM_YEAR 1.462e-02 7.415e-03 1.972 0.048631 *
PREF_ADDR_TYPE_CODEALT -4.584e-01 1.745e-01 -2.627 0.008620 **
PREF_ADDR_TYPE_CODEBUS 4.098e-03 7.477e-02 0.055 0.956296
HOUSEHOLD_CONTINENT -2.239e+00 1.508e-01 -14.846 < 2e-16 ***
HOUSEHOLD_CONTINENTAfrica -2.104e+00 1.052e+00 -1.999 0.045563 *
HOUSEHOLD_CONTINENTAsia -2.662e-01 9.017e-02 -2.952 0.003155 **
HOUSEHOLD_CONTINENTEurope -9.274e-01 1.249e-01 -7.428 1.10e-13 ***
HOUSEHOLD_CONTINENTOceania -1.211e+00 4.513e-01 -2.684 0.007279 **
HOUSEHOLD_CONTINENTSouth America -5.347e-01 1.764e-01 -3.032 0.002428 **
BUS_IS_EMPLOYEDTRUE 2.764e-01 5.995e-02 4.611 4.01e-06 ***
HAS_HOME_ADDRTRUE -1.563e-01 4.044e-02 -3.865 0.000111 ***
HAS_HOME_PHONETRUE -4.271e-01 4.205e-02 -10.157 < 2e-16 ***
HAS_HOME_EMAILTRUE 1.904e-01 5.017e-02 3.795 0.000148 ***
GIVING_FIRST_YEAR 4.110e-03 3.814e-03 1.078 0.281245
GIVING_FIRST_YEAR_CASH_AMT -7.195e-02 2.705e-02 -2.660 0.007818 **
GIVING_FIRST_YEAR_PLEDGE_AMT -1.497e-02 3.146e-02 -0.476 0.634161
GIVING_MAX_CASH_AMT -1.866e-01 9.258e-02 -2.016 0.043822 *
GIVING_MAX_PLEDGE_AMT -1.017e-02 4.939e-01 -0.021 0.983573
GIVING_CASH_TOTAL 6.074e-01 1.235e-01 4.918 8.74e-07 ***
GIVING_PLEDGE_TOTAL -1.158e-01 5.138e-01 -0.225 0.821659
GIVING_AF_TOTAL -1.139e-02 3.296e-02 -0.346 0.729709
GIVING_CRU_TOTAL -2.144e-01 8.502e-02 -2.522 0.011677 *
GIFTS_ALLOCS_SUPPORTED 7.766e-02 2.237e-02 3.471 0.000518 ***
GIFTS_FYS_SUPPORTED 6.355e-02 9.809e-03 6.479 9.24e-11 ***
GIFTS_CASH 2.578e-02 8.141e-03 3.167 0.001542 **
GIFTS_CREDIT_CARD1 -3.949e-02 6.314e-02 -0.625 0.531683
GIFTS_CREDIT_CARD2+ 4.717e-02 6.661e-02 0.708 0.478844
GIFTS_OUTRIGHTS_PAYMENTS -3.799e-02 8.402e-03 -4.522 6.13e-06 ***
GIFTS_PLEDGES -9.104e-02 7.531e-02 -1.209 0.226721
CASH_PFY1 3.288e-01 9.089e-02 3.618 0.000297 ***
CASH_PFY2 -1.120e-02 8.314e-02 -0.135 0.892800
CASH_PFY3 2.028e-01 6.616e-02 3.065 0.002174 **
CASH_PFY4 2.122e-01 4.674e-02 4.542 5.59e-06 ***
CASH_PFY5 -1.544e-01 7.514e-02 -2.055 0.039838 *
CRU_PFY1 -9.195e-02 8.519e-02 -1.079 0.280431
CRU_PFY2 -5.880e-02 9.031e-02 -0.651 0.514998
CRU_PFY3 -2.353e-01 6.619e-02 -3.555 0.000378 ***
CRU_PFY4 4.308e-02 4.686e-02 0.919 0.357879
CRU_PFY5 -9.747e-03 4.540e-02 -0.215 0.830006
CRU_GIVING_SEGMENTDonor -6.492e+00 7.658e+00 -0.848 0.396565
CRU_GIVING_SEGMENTLapsed -7.862e+00 7.640e+00 -1.029 0.303506
CRU_GIVING_SEGMENTLoyal 2 of 3 -5.791e+00 7.663e+00 -0.756 0.449808
CRU_GIVING_SEGMENTLoyal 3+ -4.654e+00 7.670e+00 -0.607 0.543964
CRU_GIVING_SEGMENTLYBUNT -6.570e+00 7.659e+00 -0.858 0.391034
CRU_GIVING_SEGMENTNon -8.363e+00 7.648e+00 -1.093 0.274178
CRU_GIVING_SEGMENTPYBUNT -7.322e+00 7.650e+00 -0.957 0.338532
GIFT_CLUB_KLC_YRS -1.127e-01 2.523e-02 -4.466 7.99e-06 ***
GIFT_CLUB_LOYAL_YRS -1.462e-02 3.249e-02 -0.450 0.652674
GIFT_CLUBS_CFY 1.202e-02 2.514e-02 0.478 0.632533
GIFT_CLUBS_PFY1 -2.204e-02 3.406e-02 -0.647 0.517448
GIFT_CLUBS_PFY2 -8.213e-03 3.088e-02 -0.266 0.790250
EVALUATION_LOWER_BOUND 9.227e-04 1.364e-02 0.068 0.946080
UOR_LOWER_BOUND 4.158e-03 1.932e-02 0.215 0.829626
KSM_GOS_FLAGTRUE 1.555e-01 1.538e-01 1.011 0.311966
MONTHS_ASSIGNED -3.909e-03 3.065e-03 -1.275 0.202133
COMMITTEE_NU_DISTINCT -1.482e-02 3.213e-02 -0.461 0.644582
COMMITTEE_NU_YEARS 2.340e-02 1.698e-02 1.378 0.168215
COMMITTEE_KSM_DISTINCT 6.122e-02 3.169e-02 1.932 0.053347 .
COMMITTEES_CFY 8.539e-02 3.920e-02 2.179 0.029364 *
COMMITTEES_PFY1 -1.054e-02 3.263e-02 -0.323 0.746621
COMMITTEES_PFY2 9.016e-03 3.673e-02 0.245 0.806095
COMMITTEES_PFY3 -2.175e-02 3.306e-02 -0.658 0.510602
EVENTS_YRS 2.154e-02 8.308e-03 2.593 0.009513 **
EVENTS_PREV_3_FY -9.849e-02 3.050e-02 -3.230 0.001240 **
KSM_EVENTS_YRS 1.581e-01 2.757e-02 5.735 9.74e-09 ***
KSM_EVENTS_PREV_3_FY -5.342e-02 1.856e-02 -2.878 0.003996 **
KSM_EVENTS_REUNIONS1 7.776e-02 5.434e-02 1.431 0.152439
KSM_EVENTS_REUNIONS2 -1.902e-01 8.835e-02 -2.153 0.031328 *
KSM_EVENTS_REUNIONS3+ -5.487e-01 1.779e-01 -3.084 0.002045 **
EVENTS_CFY 9.934e-02 3.232e-02 3.074 0.002114 **
EVENTS_PFY1 9.554e-02 3.346e-02 2.855 0.004300 **
ATHLETICS_TICKET_YEARS 9.034e-03 2.599e-02 0.348 0.728140
ATHLETICS_TICKET_LAST -4.174e-05 6.988e-05 -0.597 0.550273
RECORD_YR 1.211e-02 2.597e-03 4.663 3.12e-06 ***
GIVING_MAX_CASH_YR 2.361e-02 2.997e-03 7.877 3.35e-15 ***
GIVING_MAX_CASH_MO2 3.805e-02 1.144e-01 0.333 0.739334
GIVING_MAX_CASH_MO3 1.636e-02 1.054e-01 0.155 0.876686
GIVING_MAX_CASH_MO4 -4.970e-02 1.041e-01 -0.477 0.633063
GIVING_MAX_CASH_MO5 1.077e-01 9.933e-02 1.084 0.278198
GIVING_MAX_CASH_MO6 -8.919e-02 1.009e-01 -0.884 0.376823
GIVING_MAX_CASH_MO7 2.662e-02 1.108e-01 0.240 0.810133
GIVING_MAX_CASH_MO8 5.940e-02 9.899e-02 0.600 0.548468
GIVING_MAX_CASH_MO9 2.047e-03 1.259e-01 0.016 0.987028
GIVING_MAX_CASH_MO10 1.567e-01 1.167e-01 1.343 0.179388
GIVING_MAX_CASH_MO11 1.004e-01 1.086e-01 0.925 0.355095
GIVING_MAX_CASH_MO12 2.081e-01 9.268e-02 2.245 0.024752 *
GIVING_MAX_PLEDGE_MO2 -2.644e-01 9.747e-02 -2.712 0.006686 **
GIVING_MAX_PLEDGE_MO3 -1.080e-01 9.082e-02 -1.189 0.234571
GIVING_MAX_PLEDGE_MO4 -2.955e-01 9.925e-02 -2.978 0.002903 **
GIVING_MAX_PLEDGE_MO5 -1.962e-01 8.749e-02 -2.243 0.024915 *
GIVING_MAX_PLEDGE_MO6 -3.767e-01 9.013e-02 -4.179 2.93e-05 ***
GIVING_MAX_PLEDGE_MO7 -6.998e-02 1.112e-01 -0.629 0.529099
GIVING_MAX_PLEDGE_MO8 -1.837e-01 9.768e-02 -1.880 0.060051 .
GIVING_MAX_PLEDGE_MO9 1.065e-01 1.008e-01 1.056 0.290869
GIVING_MAX_PLEDGE_MO10 -2.566e-01 8.657e-02 -2.964 0.003032 **
GIVING_MAX_PLEDGE_MO11 -2.027e-01 9.036e-02 -2.243 0.024880 *
GIVING_MAX_PLEDGE_MO12 -2.048e-01 1.003e-01 -2.042 0.041189 *
KSM_PROSPECTNo -1.250e-01 9.011e-02 -1.387 0.165381
KSM_PROSPECTPast -1.586e-01 1.238e-01 -1.281 0.200051
VISITORS_5FY -1.481e-02 2.882e-02 -0.514 0.607336
LOYAL_5_PCT_CASH 3.946e+00 8.421e-01 4.686 2.79e-06 ***
UPGRADE3_CASH-1 2.256e-01 1.725e-01 1.308 0.190971
UPGRADE3_CASH0 7.132e-02 1.739e-01 0.410 0.681652
UPGRADE3_CASH1 4.819e-05 2.098e-01 0.000 0.999817
UPGRADE3_CASH2 1.936e-01 2.346e-01 0.825 0.409251
VELOCITY3_CASH -2.512e-01 1.061e-01 -2.368 0.017872 *
VELOCITY3_LIN_CASH 3.108e-02 2.848e-02 1.091 0.275133
SPOUSE_ALUMTRUE -2.889e+01 1.482e+01 -1.950 0.051213 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46782 on 76095 degrees of freedom
Residual deviance: 23309 on 75981 degrees of freedom
AIC: 23539
Number of Fisher Scoring iterations: 9
summary(glm_standard, corr = TRUE)$correlation %>%
data.frame() %>%
plot_corrs()
Pretty eyewatering. Look at the term plots.
termplot(glm_standard)
Definitely keep:
Definitely drop:
Needs transformation:
Duplicative:
glm_standard <- glm_standard %>% update(
data = traindat %>% select(
-HOUSEHOLD_ID
, -RECORD_STATUS_CODE
# Drop
, -HAS_HOME_EMAIL
, -GIFTS_CREDIT_CARD
, -contains('GIFT_CLUB')
, -KSM_GOS_FLAG
, -EVALUATION_LOWER_BOUND
, -UOR_LOWER_BOUND
, -KSM_PROSPECT
, -KSM_EVENTS_REUNIONS
, -GIVING_MAX_PLEDGE_MO
# Duplicative
, -GIVING_FIRST_YEAR_PLEDGE_AMT
, -GIVING_MAX_CASH_AMT
, -GIVING_AF_TOTAL
, -GIFTS_OUTRIGHTS_PAYMENTS
, -contains('CRU_PFY')
, -contains('COMMITTEES_')
, -contains('EVENTS_YRS')
, -VELOCITY3_CASH
, -SPOUSE_FIRST_KSM_YEAR
)
)
summary(glm_standard)
Call:
glm(formula = rv.gave ~ ., family = "binomial", data = traindat %>%
select(-HOUSEHOLD_ID, -RECORD_STATUS_CODE, -HAS_HOME_EMAIL,
-GIFTS_CREDIT_CARD, -contains("GIFT_CLUB"), -KSM_GOS_FLAG,
-EVALUATION_LOWER_BOUND, -UOR_LOWER_BOUND, -KSM_PROSPECT,
-KSM_EVENTS_REUNIONS, -GIVING_MAX_PLEDGE_MO, -GIVING_FIRST_YEAR_PLEDGE_AMT,
-GIVING_MAX_CASH_AMT, -GIVING_AF_TOTAL, -GIFTS_OUTRIGHTS_PAYMENTS,
-contains("CRU_PFY"), -contains("COMMITTEES_"), -contains("EVENTS_YRS"),
-VELOCITY3_CASH, -SPOUSE_FIRST_KSM_YEAR))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8951 -0.2673 -0.1436 -0.0347 3.8464
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.419e+01 5.939e+00 -14.174 < 2e-16 ***
PROGRAM_GROUPNONE -1.338e+00 1.182e-01 -11.319 < 2e-16 ***
PROGRAM_GROUPEMP -1.731e-01 5.886e-02 -2.941 0.003274 **
PROGRAM_GROUPEXECED -3.104e+00 3.889e-01 -7.981 1.45e-15 ***
PROGRAM_GROUPNONGRD -1.596e+00 5.221e-01 -3.057 0.002236 **
PROGRAM_GROUPPHD 3.468e-02 2.058e-01 0.169 0.866170
PROGRAM_GROUPTMP -2.917e-01 4.946e-02 -5.898 3.69e-09 ***
PREF_ADDR_TYPE_CODEALT -5.028e-01 1.743e-01 -2.885 0.003918 **
PREF_ADDR_TYPE_CODEBUS -1.564e-02 7.414e-02 -0.211 0.832919
HOUSEHOLD_CONTINENT -2.355e+00 1.490e-01 -15.802 < 2e-16 ***
HOUSEHOLD_CONTINENTAfrica -2.200e+00 1.059e+00 -2.079 0.037658 *
HOUSEHOLD_CONTINENTAsia -3.379e-01 8.889e-02 -3.801 0.000144 ***
HOUSEHOLD_CONTINENTEurope -9.721e-01 1.239e-01 -7.845 4.32e-15 ***
HOUSEHOLD_CONTINENTOceania -1.377e+00 4.488e-01 -3.067 0.002159 **
HOUSEHOLD_CONTINENTSouth America -5.796e-01 1.755e-01 -3.302 0.000959 ***
BUS_IS_EMPLOYEDTRUE 3.331e-01 5.900e-02 5.645 1.65e-08 ***
HAS_HOME_ADDRTRUE -1.572e-01 4.010e-02 -3.921 8.83e-05 ***
HAS_HOME_PHONETRUE -4.160e-01 4.170e-02 -9.975 < 2e-16 ***
GIVING_FIRST_YEAR 2.853e-03 3.664e-03 0.778 0.436284
GIVING_FIRST_YEAR_CASH_AMT -5.655e-02 2.418e-02 -2.339 0.019337 *
GIVING_MAX_PLEDGE_AMT -2.016e-01 4.864e-01 -0.414 0.678597
GIVING_CASH_TOTAL 5.296e-01 7.527e-02 7.036 1.98e-12 ***
GIVING_PLEDGE_TOTAL 6.126e-02 5.073e-01 0.121 0.903892
GIVING_CRU_TOTAL -3.640e-01 7.006e-02 -5.195 2.05e-07 ***
GIFTS_ALLOCS_SUPPORTED 8.033e-02 2.182e-02 3.681 0.000232 ***
GIFTS_FYS_SUPPORTED 5.184e-02 7.605e-03 6.816 9.34e-12 ***
GIFTS_CASH 9.161e-04 5.073e-03 0.181 0.856694
GIFTS_PLEDGES -1.028e-01 7.436e-02 -1.383 0.166771
CASH_PFY1 2.226e-01 4.944e-02 4.502 6.73e-06 ***
CASH_PFY2 -4.504e-02 3.477e-02 -1.296 0.195126
CASH_PFY3 -9.976e-03 3.557e-02 -0.280 0.779122
CASH_PFY4 2.323e-01 2.329e-02 9.974 < 2e-16 ***
CASH_PFY5 -3.178e-01 6.238e-02 -5.095 3.50e-07 ***
CRU_GIVING_SEGMENTDonor -3.948e+00 7.361e+00 -0.536 0.591754
CRU_GIVING_SEGMENTLapsed -5.277e+00 7.348e+00 -0.718 0.472691
CRU_GIVING_SEGMENTLoyal 2 of 3 -3.406e+00 7.362e+00 -0.463 0.643658
CRU_GIVING_SEGMENTLoyal 3+ -2.402e+00 7.365e+00 -0.326 0.744343
CRU_GIVING_SEGMENTLYBUNT -4.086e+00 7.361e+00 -0.555 0.578847
CRU_GIVING_SEGMENTNon -6.137e+00 7.348e+00 -0.835 0.403556
CRU_GIVING_SEGMENTPYBUNT -4.765e+00 7.356e+00 -0.648 0.517172
MONTHS_ASSIGNED -3.554e-03 2.816e-03 -1.262 0.206826
COMMITTEE_NU_DISTINCT 3.956e-03 2.844e-02 0.139 0.889351
COMMITTEE_NU_YEARS 2.901e-02 1.682e-02 1.725 0.084600 .
COMMITTEE_KSM_DISTINCT 6.049e-02 2.774e-02 2.181 0.029220 *
EVENTS_PREV_3_FY -3.338e-02 2.614e-02 -1.277 0.201530
KSM_EVENTS_PREV_3_FY -1.045e-02 1.600e-02 -0.653 0.513713
EVENTS_CFY 6.143e-02 3.173e-02 1.936 0.052843 .
EVENTS_PFY1 6.426e-02 3.255e-02 1.974 0.048328 *
ATHLETICS_TICKET_YEARS 1.607e-02 2.536e-02 0.634 0.526260
ATHLETICS_TICKET_LAST -6.974e-05 6.915e-05 -1.009 0.313192
RECORD_YR 1.326e-02 2.454e-03 5.406 6.44e-08 ***
GIVING_MAX_CASH_YR 2.701e-02 2.907e-03 9.293 < 2e-16 ***
GIVING_MAX_CASH_MO2 -1.571e-02 1.119e-01 -0.140 0.888337
GIVING_MAX_CASH_MO3 -2.271e-02 1.036e-01 -0.219 0.826475
GIVING_MAX_CASH_MO4 -7.325e-02 1.026e-01 -0.714 0.475058
GIVING_MAX_CASH_MO5 6.974e-02 9.782e-02 0.713 0.475907
GIVING_MAX_CASH_MO6 -1.621e-01 9.932e-02 -1.632 0.102736
GIVING_MAX_CASH_MO7 4.317e-03 1.093e-01 0.040 0.968491
GIVING_MAX_CASH_MO8 9.995e-03 9.708e-02 0.103 0.917992
GIVING_MAX_CASH_MO9 8.731e-02 1.219e-01 0.716 0.473926
GIVING_MAX_CASH_MO10 8.821e-02 1.141e-01 0.773 0.439596
GIVING_MAX_CASH_MO11 4.664e-02 1.063e-01 0.439 0.660817
GIVING_MAX_CASH_MO12 1.880e-01 9.136e-02 2.058 0.039569 *
VISITORS_5FY 1.076e-02 2.712e-02 0.397 0.691686
LOYAL_5_PCT_CASH 5.825e+00 7.635e-01 7.629 2.36e-14 ***
UPGRADE3_CASH-1 1.850e-01 1.691e-01 1.094 0.273969
UPGRADE3_CASH0 7.754e-02 1.723e-01 0.450 0.652585
UPGRADE3_CASH1 1.043e-02 2.069e-01 0.050 0.959775
UPGRADE3_CASH2 2.263e-02 2.233e-01 0.101 0.919281
VELOCITY3_LIN_CASH 2.131e-02 2.796e-02 0.762 0.445873
SPOUSE_ALUMTRUE 3.334e-01 8.541e-02 3.903 9.50e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46782 on 76095 degrees of freedom
Residual deviance: 23539 on 76025 degrees of freedom
AIC: 23681
Number of Fisher Scoring iterations: 9
The AIC is higher, but comparable.
dfs <- 4
Now introduce splines on the numeric variables, arbitrarily setting df = 4.
glm_st_splines <- glm(
rv.gave ~
PROGRAM_GROUP +
PREF_ADDR_TYPE_CODE +
HOUSEHOLD_CONTINENT +
BUS_IS_EMPLOYED +
HAS_HOME_ADDR +
HAS_HOME_PHONE +
ns(YEARS_SINCE_FIRST_GIFT, df = dfs) +
ns(GIVING_FIRST_YEAR_CASH_AMT, df = dfs) +
ns(GIVING_MAX_PLEDGE_AMT, df = dfs) +
ns(GIVING_CASH_TOTAL, df = dfs) +
ns(GIVING_PLEDGE_TOTAL, df = dfs) +
ns(GIVING_CRU_TOTAL, df = dfs) +
ns(GIFTS_ALLOCS_SUPPORTED, df = dfs) +
ns(GIFTS_FYS_SUPPORTED, df = dfs) +
ns(GIFTS_CASH, df = dfs) +
ns(GIFTS_PLEDGES, df = dfs) +
ns(CASH_PFY1, df = dfs) +
ns(CASH_PFY2, df = dfs) +
ns(CASH_PFY3, df = dfs) +
ns(CASH_PFY4, df = dfs) +
ns(CASH_PFY5, df = dfs) +
CRU_GIVING_SEGMENT +
ns(EVALUATION_LOWER_BOUND, df = dfs) +
ns(UOR_LOWER_BOUND, df = dfs) +
ns(MONTHS_ASSIGNED, df = dfs) +
ns(COMMITTEE_NU_DISTINCT, df = dfs) +
ns(COMMITTEE_NU_YEARS, df = dfs) +
ns(COMMITTEE_KSM_DISTINCT, df = dfs) +
ns(EVENTS_PREV_3_FY, df = dfs) +
ns(EVENTS_CFY, df = dfs) +
ns(EVENTS_PFY1, df = dfs) +
ns(ATHLETICS_TICKET_YEARS, df = dfs) +
ns(YEARS_SINCE_ATHLETICS_TICKETS, df = dfs) +
ns(RECORD_YR, df = dfs) +
ns(YEARS_SINCE_MAX_CASH_YR, df = dfs) +
GIVING_MAX_CASH_MO +
KSM_PROSPECT +
ns(VISITORS_5FY, df = dfs) +
LOYAL_5_PCT_CASH +
UPGRADE3_CASH +
VELOCITY3_LIN_CASH +
SPOUSE_ALUM
, data = traindat %>% mutate(
YEARS_SINCE_FIRST_GIFT = 2016 - ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017)
, YEARS_SINCE_ATHLETICS_TICKETS = 2016 - ifelse(ATHLETICS_TICKET_LAST > 0, ATHLETICS_TICKET_LAST, 2017)
, YEARS_SINCE_MAX_CASH_YR = 2016 - ifelse(GIVING_MAX_CASH_YR > 0, GIVING_MAX_CASH_YR, 2017)
)
, family = 'binomial'
)
summary(glm_st_splines)
Call:
glm(formula = rv.gave ~ PROGRAM_GROUP + PREF_ADDR_TYPE_CODE +
HOUSEHOLD_CONTINENT + BUS_IS_EMPLOYED + HAS_HOME_ADDR + HAS_HOME_PHONE +
ns(YEARS_SINCE_FIRST_GIFT, df = dfs) + ns(GIVING_FIRST_YEAR_CASH_AMT,
df = dfs) + ns(GIVING_MAX_PLEDGE_AMT, df = dfs) + ns(GIVING_CASH_TOTAL,
df = dfs) + ns(GIVING_PLEDGE_TOTAL, df = dfs) + ns(GIVING_CRU_TOTAL,
df = dfs) + ns(GIFTS_ALLOCS_SUPPORTED, df = dfs) + ns(GIFTS_FYS_SUPPORTED,
df = dfs) + ns(GIFTS_CASH, df = dfs) + ns(GIFTS_PLEDGES,
df = dfs) + ns(CASH_PFY1, df = dfs) + ns(CASH_PFY2, df = dfs) +
ns(CASH_PFY3, df = dfs) + ns(CASH_PFY4, df = dfs) + ns(CASH_PFY5,
df = dfs) + CRU_GIVING_SEGMENT + ns(EVALUATION_LOWER_BOUND,
df = dfs) + ns(UOR_LOWER_BOUND, df = dfs) + ns(MONTHS_ASSIGNED,
df = dfs) + ns(COMMITTEE_NU_DISTINCT, df = dfs) + ns(COMMITTEE_NU_YEARS,
df = dfs) + ns(COMMITTEE_KSM_DISTINCT, df = dfs) + ns(EVENTS_PREV_3_FY,
df = dfs) + ns(EVENTS_CFY, df = dfs) + ns(EVENTS_PFY1, df = dfs) +
ns(ATHLETICS_TICKET_YEARS, df = dfs) + ns(YEARS_SINCE_ATHLETICS_TICKETS,
df = dfs) + ns(RECORD_YR, df = dfs) + ns(YEARS_SINCE_MAX_CASH_YR,
df = dfs) + GIVING_MAX_CASH_MO + KSM_PROSPECT + ns(VISITORS_5FY,
df = dfs) + LOYAL_5_PCT_CASH + UPGRADE3_CASH + VELOCITY3_LIN_CASH +
SPOUSE_ALUM, family = "binomial", data = traindat %>% mutate(YEARS_SINCE_FIRST_GIFT = 2016 -
ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017), YEARS_SINCE_ATHLETICS_TICKETS = 2016 -
ifelse(ATHLETICS_TICKET_LAST > 0, ATHLETICS_TICKET_LAST,
2017), YEARS_SINCE_MAX_CASH_YR = 2016 - ifelse(GIVING_MAX_CASH_YR >
0, GIVING_MAX_CASH_YR, 2017)))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9223 -0.2695 -0.1453 -0.0317 3.9149
Coefficients: (71 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.691519 3.223131 0.525 0.599718
PROGRAM_GROUPNONE -1.307533 0.121008 -10.805 < 2e-16 ***
PROGRAM_GROUPEMP -0.161109 0.059800 -2.694 0.007057 **
PROGRAM_GROUPEXECED -3.127629 0.389833 -8.023 1.03e-15 ***
PROGRAM_GROUPNONGRD -1.719495 0.563736 -3.050 0.002287 **
PROGRAM_GROUPPHD 0.021555 0.206089 0.105 0.916700
PROGRAM_GROUPTMP -0.298704 0.050454 -5.920 3.21e-09 ***
PREF_ADDR_TYPE_CODEALT -0.456107 0.175104 -2.605 0.009193 **
PREF_ADDR_TYPE_CODEBUS 0.003857 0.073975 0.052 0.958424
HOUSEHOLD_CONTINENT -2.217280 0.154429 -14.358 < 2e-16 ***
HOUSEHOLD_CONTINENTAfrica -2.204274 1.064301 -2.071 0.038349 *
HOUSEHOLD_CONTINENTAsia -0.298478 0.088878 -3.358 0.000784 ***
HOUSEHOLD_CONTINENTEurope -0.932184 0.124519 -7.486 7.09e-14 ***
HOUSEHOLD_CONTINENTOceania -1.424400 0.455825 -3.125 0.001779 **
HOUSEHOLD_CONTINENTSouth America -0.531665 0.175506 -3.029 0.002451 **
BUS_IS_EMPLOYEDTRUE 0.296278 0.060620 4.887 1.02e-06 ***
HAS_HOME_ADDRTRUE -0.165690 0.040389 -4.102 4.09e-05 ***
HAS_HOME_PHONETRUE -0.418747 0.041920 -9.989 < 2e-16 ***
ns(YEARS_SINCE_FIRST_GIFT, df = dfs)1 1.426763 0.393234 3.628 0.000285 ***
ns(YEARS_SINCE_FIRST_GIFT, df = dfs)2 1.580360 0.268025 5.896 3.72e-09 ***
ns(YEARS_SINCE_FIRST_GIFT, df = dfs)3 NA NA NA NA
ns(YEARS_SINCE_FIRST_GIFT, df = dfs)4 NA NA NA NA
ns(GIVING_FIRST_YEAR_CASH_AMT, df = dfs)1 -0.899549 0.828477 -1.086 0.277573
ns(GIVING_FIRST_YEAR_CASH_AMT, df = dfs)2 -0.976102 0.637153 -1.532 0.125529
ns(GIVING_FIRST_YEAR_CASH_AMT, df = dfs)3 NA NA NA NA
ns(GIVING_FIRST_YEAR_CASH_AMT, df = dfs)4 NA NA NA NA
ns(GIVING_MAX_PLEDGE_AMT, df = dfs)1 3.743279 3.808069 0.983 0.325614
ns(GIVING_MAX_PLEDGE_AMT, df = dfs)2 NA NA NA NA
ns(GIVING_MAX_PLEDGE_AMT, df = dfs)3 NA NA NA NA
ns(GIVING_MAX_PLEDGE_AMT, df = dfs)4 NA NA NA NA
ns(GIVING_CASH_TOTAL, df = dfs)1 -3.816384 1.166911 -3.271 0.001074 **
ns(GIVING_CASH_TOTAL, df = dfs)2 0.716513 0.810228 0.884 0.376516
ns(GIVING_CASH_TOTAL, df = dfs)3 NA NA NA NA
ns(GIVING_CASH_TOTAL, df = dfs)4 NA NA NA NA
ns(GIVING_PLEDGE_TOTAL, df = dfs)1 -2.776750 4.026394 -0.690 0.490423
ns(GIVING_PLEDGE_TOTAL, df = dfs)2 NA NA NA NA
ns(GIVING_PLEDGE_TOTAL, df = dfs)3 NA NA NA NA
ns(GIVING_PLEDGE_TOTAL, df = dfs)4 NA NA NA NA
ns(GIVING_CRU_TOTAL, df = dfs)1 5.198087 0.904969 5.744 9.25e-09 ***
ns(GIVING_CRU_TOTAL, df = dfs)2 2.821611 0.790900 3.568 0.000360 ***
ns(GIVING_CRU_TOTAL, df = dfs)3 NA NA NA NA
ns(GIVING_CRU_TOTAL, df = dfs)4 NA NA NA NA
ns(GIFTS_ALLOCS_SUPPORTED, df = dfs)1 -1.338855 1.105541 -1.211 0.225880
ns(GIFTS_ALLOCS_SUPPORTED, df = dfs)2 0.202803 1.094022 0.185 0.852936
ns(GIFTS_ALLOCS_SUPPORTED, df = dfs)3 NA NA NA NA
ns(GIFTS_ALLOCS_SUPPORTED, df = dfs)4 NA NA NA NA
ns(GIFTS_FYS_SUPPORTED, df = dfs)1 -4.272014 0.561592 -7.607 2.81e-14 ***
ns(GIFTS_FYS_SUPPORTED, df = dfs)2 -1.017990 0.358445 -2.840 0.004511 **
ns(GIFTS_FYS_SUPPORTED, df = dfs)3 NA NA NA NA
ns(GIFTS_FYS_SUPPORTED, df = dfs)4 NA NA NA NA
ns(GIFTS_CASH, df = dfs)1 -5.097566 4.160092 -1.225 0.220444
ns(GIFTS_CASH, df = dfs)2 -3.498484 3.546673 -0.986 0.323931
ns(GIFTS_CASH, df = dfs)3 NA NA NA NA
ns(GIFTS_CASH, df = dfs)4 NA NA NA NA
ns(GIFTS_PLEDGES, df = dfs)1 2.057330 0.829133 2.481 0.013090 *
ns(GIFTS_PLEDGES, df = dfs)2 NA NA NA NA
ns(GIFTS_PLEDGES, df = dfs)3 NA NA NA NA
ns(GIFTS_PLEDGES, df = dfs)4 NA NA NA NA
ns(CASH_PFY1, df = dfs)1 -2.103193 0.356799 -5.895 3.76e-09 ***
ns(CASH_PFY1, df = dfs)2 NA NA NA NA
ns(CASH_PFY1, df = dfs)3 NA NA NA NA
ns(CASH_PFY1, df = dfs)4 NA NA NA NA
ns(CASH_PFY2, df = dfs)1 -0.065122 0.277359 -0.235 0.814370
ns(CASH_PFY2, df = dfs)2 NA NA NA NA
ns(CASH_PFY2, df = dfs)3 NA NA NA NA
ns(CASH_PFY2, df = dfs)4 NA NA NA NA
ns(CASH_PFY3, df = dfs)1 -0.311056 0.265752 -1.170 0.241810
ns(CASH_PFY3, df = dfs)2 NA NA NA NA
ns(CASH_PFY3, df = dfs)3 NA NA NA NA
ns(CASH_PFY3, df = dfs)4 NA NA NA NA
ns(CASH_PFY4, df = dfs)1 -1.652616 0.172823 -9.562 < 2e-16 ***
ns(CASH_PFY4, df = dfs)2 NA NA NA NA
ns(CASH_PFY4, df = dfs)3 NA NA NA NA
ns(CASH_PFY4, df = dfs)4 NA NA NA NA
ns(CASH_PFY5, df = dfs)1 1.257206 0.534248 2.353 0.018611 *
ns(CASH_PFY5, df = dfs)2 NA NA NA NA
ns(CASH_PFY5, df = dfs)3 NA NA NA NA
ns(CASH_PFY5, df = dfs)4 NA NA NA NA
CRU_GIVING_SEGMENTDonor 0.798294 0.235017 3.397 0.000682 ***
CRU_GIVING_SEGMENTLapsed -0.551629 0.207531 -2.658 0.007859 **
CRU_GIVING_SEGMENTLoyal 2 of 3 1.189120 0.225871 5.265 1.40e-07 ***
CRU_GIVING_SEGMENTLoyal 3+ 2.048346 0.251541 8.143 3.85e-16 ***
CRU_GIVING_SEGMENTLYBUNT 0.630074 0.223302 2.822 0.004778 **
CRU_GIVING_SEGMENTNon -0.706040 0.362783 -1.946 0.051633 .
CRU_GIVING_SEGMENTPYBUNT 0.038088 0.207834 0.183 0.854593
ns(EVALUATION_LOWER_BOUND, df = dfs)1 0.000696 0.121294 0.006 0.995422
ns(EVALUATION_LOWER_BOUND, df = dfs)2 NA NA NA NA
ns(EVALUATION_LOWER_BOUND, df = dfs)3 NA NA NA NA
ns(EVALUATION_LOWER_BOUND, df = dfs)4 NA NA NA NA
ns(UOR_LOWER_BOUND, df = dfs)1 -0.141311 0.169779 -0.832 0.405225
ns(UOR_LOWER_BOUND, df = dfs)2 NA NA NA NA
ns(UOR_LOWER_BOUND, df = dfs)3 NA NA NA NA
ns(UOR_LOWER_BOUND, df = dfs)4 NA NA NA NA
ns(MONTHS_ASSIGNED, df = dfs)1 0.304049 0.570940 0.533 0.594352
ns(MONTHS_ASSIGNED, df = dfs)2 NA NA NA NA
ns(MONTHS_ASSIGNED, df = dfs)3 NA NA NA NA
ns(MONTHS_ASSIGNED, df = dfs)4 NA NA NA NA
ns(COMMITTEE_NU_DISTINCT, df = dfs)1 -0.653461 2.072880 -0.315 0.752577
ns(COMMITTEE_NU_DISTINCT, df = dfs)2 0.403521 1.913680 0.211 0.832995
ns(COMMITTEE_NU_DISTINCT, df = dfs)3 NA NA NA NA
ns(COMMITTEE_NU_DISTINCT, df = dfs)4 NA NA NA NA
ns(COMMITTEE_NU_YEARS, df = dfs)1 0.221676 0.988885 0.224 0.822627
ns(COMMITTEE_NU_YEARS, df = dfs)2 0.863657 1.029639 0.839 0.401584
ns(COMMITTEE_NU_YEARS, df = dfs)3 NA NA NA NA
ns(COMMITTEE_NU_YEARS, df = dfs)4 NA NA NA NA
ns(COMMITTEE_KSM_DISTINCT, df = dfs)1 -1.970115 1.212583 -1.625 0.104221
ns(COMMITTEE_KSM_DISTINCT, df = dfs)2 -1.599907 1.038235 -1.541 0.123320
ns(COMMITTEE_KSM_DISTINCT, df = dfs)3 NA NA NA NA
ns(COMMITTEE_KSM_DISTINCT, df = dfs)4 NA NA NA NA
ns(EVENTS_PREV_3_FY, df = dfs)1 3.633802 3.436981 1.057 0.290390
ns(EVENTS_PREV_3_FY, df = dfs)2 NA NA NA NA
ns(EVENTS_PREV_3_FY, df = dfs)3 NA NA NA NA
ns(EVENTS_PREV_3_FY, df = dfs)4 NA NA NA NA
ns(EVENTS_CFY, df = dfs)1 -2.280573 1.479784 -1.541 0.123280
ns(EVENTS_CFY, df = dfs)2 NA NA NA NA
ns(EVENTS_CFY, df = dfs)3 NA NA NA NA
ns(EVENTS_CFY, df = dfs)4 NA NA NA NA
ns(EVENTS_PFY1, df = dfs)1 -3.165687 1.775252 -1.783 0.074549 .
ns(EVENTS_PFY1, df = dfs)2 NA NA NA NA
ns(EVENTS_PFY1, df = dfs)3 NA NA NA NA
ns(EVENTS_PFY1, df = dfs)4 NA NA NA NA
ns(ATHLETICS_TICKET_YEARS, df = dfs)1 -0.027763 0.214692 -0.129 0.897108
ns(ATHLETICS_TICKET_YEARS, df = dfs)2 NA NA NA NA
ns(ATHLETICS_TICKET_YEARS, df = dfs)3 NA NA NA NA
ns(ATHLETICS_TICKET_YEARS, df = dfs)4 NA NA NA NA
ns(YEARS_SINCE_ATHLETICS_TICKETS, df = dfs)1 0.095205 0.310646 0.306 0.759244
ns(YEARS_SINCE_ATHLETICS_TICKETS, df = dfs)2 NA NA NA NA
ns(YEARS_SINCE_ATHLETICS_TICKETS, df = dfs)3 NA NA NA NA
ns(YEARS_SINCE_ATHLETICS_TICKETS, df = dfs)4 NA NA NA NA
ns(RECORD_YR, df = dfs)1 2.288420 0.847561 2.700 0.006934 **
ns(RECORD_YR, df = dfs)2 1.731401 0.605119 2.861 0.004220 **
ns(RECORD_YR, df = dfs)3 5.267373 1.838442 2.865 0.004168 **
ns(RECORD_YR, df = dfs)4 1.452452 0.302687 4.799 1.60e-06 ***
ns(YEARS_SINCE_MAX_CASH_YR, df = dfs)1 -0.433171 0.118762 -3.647 0.000265 ***
ns(YEARS_SINCE_MAX_CASH_YR, df = dfs)2 -0.778818 0.410460 -1.897 0.057771 .
ns(YEARS_SINCE_MAX_CASH_YR, df = dfs)3 -4.254386 1.052326 -4.043 5.28e-05 ***
ns(YEARS_SINCE_MAX_CASH_YR, df = dfs)4 -7.094770 2.126671 -3.336 0.000850 ***
GIVING_MAX_CASH_MO2 -0.021552 0.112346 -0.192 0.847870
GIVING_MAX_CASH_MO3 -0.018255 0.103821 -0.176 0.860429
GIVING_MAX_CASH_MO4 -0.054266 0.103015 -0.527 0.598348
GIVING_MAX_CASH_MO5 0.111024 0.098181 1.131 0.258134
GIVING_MAX_CASH_MO6 -0.092274 0.099961 -0.923 0.355956
GIVING_MAX_CASH_MO7 0.025998 0.109582 0.237 0.812464
GIVING_MAX_CASH_MO8 0.052642 0.097686 0.539 0.589964
GIVING_MAX_CASH_MO9 0.114851 0.122398 0.938 0.348066
GIVING_MAX_CASH_MO10 0.109361 0.114806 0.953 0.340807
GIVING_MAX_CASH_MO11 0.027763 0.106798 0.260 0.794895
GIVING_MAX_CASH_MO12 0.186771 0.091782 2.035 0.041857 *
KSM_PROSPECTNo -0.149850 0.089256 -1.679 0.093176 .
KSM_PROSPECTPast -0.096806 0.123150 -0.786 0.431819
ns(VISITORS_5FY, df = dfs)1 -0.405392 0.752794 -0.539 0.590220
ns(VISITORS_5FY, df = dfs)2 NA NA NA NA
ns(VISITORS_5FY, df = dfs)3 NA NA NA NA
ns(VISITORS_5FY, df = dfs)4 NA NA NA NA
LOYAL_5_PCT_CASH 3.772261 0.818488 4.609 4.05e-06 ***
UPGRADE3_CASH-1 0.134585 0.169585 0.794 0.427421
UPGRADE3_CASH0 0.113803 0.172890 0.658 0.510386
UPGRADE3_CASH1 0.042539 0.208485 0.204 0.838323
UPGRADE3_CASH2 0.079741 0.225502 0.354 0.723627
VELOCITY3_LIN_CASH 0.009478 0.027977 0.339 0.734766
SPOUSE_ALUMTRUE 0.322650 0.085743 3.763 0.000168 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46782 on 76095 degrees of freedom
Residual deviance: 23419 on 76006 degrees of freedom
AIC: 23599
Number of Fisher Scoring iterations: 10
termplot(glm_st_splines)
Some more thoughts.
glm_st_splines <- glm(
rv.gave ~
PROGRAM_GROUP +
PREF_ADDR_TYPE_CODE +
HOUSEHOLD_CONTINENT +
BUS_IS_EMPLOYED +
HAS_HOME_ADDR +
HAS_HOME_PHONE +
# ns(YEARS_SINCE_FIRST_GIFT, 2) +
# GIVING_FIRST_YEAR_CASH_AMT +
# GIVING_MAX_PLEDGE_AMT +
GIVING_CASH_TOTAL +
# GIVING_PLEDGE_TOTAL +
# GIVING_CRU_TOTAL +
# sqrt(GIFTS_ALLOCS_SUPPORTED) +
sqrt(GIFTS_FYS_SUPPORTED) +
# sqrt(GIFTS_CASH) +
sqrt(GIFTS_PLEDGES) +
CASH_PFY1 +
CASH_PFY2 +
CASH_PFY3 +
CASH_PFY4 +
CASH_PFY5 +
CRU_GIVING_SEGMENT +
# EVALUATION_LOWER_BOUND +
# UOR_LOWER_BOUND +
# sqrt(MONTHS_ASSIGNED) +
# sqrt(COMMITTEE_NU_DISTINCT) +
# sqrt(COMMITTEE_NU_YEARS) +
# sqrt(COMMITTEE_KSM_DISTINCT) +
# sqrt(EVENTS_PREV_3_FY) +
sqrt(EVENTS_CFY) +
# sqrt(EVENTS_PFY1) +
# sqrt(ATHLETICS_TICKET_YEARS) +
# YEARS_SINCE_ATHLETICS_TICKETS +
ns(RECORD_YR, df = 5) +
YEARS_SINCE_MAX_CASH_YR +
GIVING_MAX_CASH_MO +
# KSM_PROSPECT +
# sqrt(VISITORS_5FY) +
LOYAL_5_PCT_CASH +
# UPGRADE3_CASH +
# VELOCITY3_LIN_CASH +
SPOUSE_ALUM
, data = traindat %>% mutate(
YEARS_SINCE_FIRST_GIFT = 2016 - ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017)
, YEARS_SINCE_MAX_CASH_YR = 2016 - ifelse(GIVING_MAX_CASH_YR > 0, GIVING_MAX_CASH_YR, 2017)
)
, family = 'binomial'
)
summary(glm_st_splines)
Call:
glm(formula = rv.gave ~ PROGRAM_GROUP + PREF_ADDR_TYPE_CODE +
HOUSEHOLD_CONTINENT + BUS_IS_EMPLOYED + HAS_HOME_ADDR + HAS_HOME_PHONE +
GIVING_CASH_TOTAL + sqrt(GIFTS_FYS_SUPPORTED) + sqrt(GIFTS_PLEDGES) +
CASH_PFY1 + CASH_PFY2 + CASH_PFY3 + CASH_PFY4 + CASH_PFY5 +
CRU_GIVING_SEGMENT + sqrt(EVENTS_CFY) + ns(RECORD_YR, df = 5) +
YEARS_SINCE_MAX_CASH_YR + GIVING_MAX_CASH_MO + LOYAL_5_PCT_CASH +
SPOUSE_ALUM, family = "binomial", data = traindat %>% mutate(YEARS_SINCE_FIRST_GIFT = 2016 -
ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017), YEARS_SINCE_MAX_CASH_YR = 2016 -
ifelse(GIVING_MAX_CASH_YR > 0, GIVING_MAX_CASH_YR, 2017)))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7695 -0.2668 -0.1427 -0.0356 3.7933
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.884909 1.256757 -6.274 3.52e-10 ***
PROGRAM_GROUPNONE -1.348272 0.113552 -11.874 < 2e-16 ***
PROGRAM_GROUPEMP -0.218890 0.057401 -3.813 0.000137 ***
PROGRAM_GROUPEXECED -3.188161 0.390359 -8.167 3.15e-16 ***
PROGRAM_GROUPNONGRD -1.645957 0.505930 -3.253 0.001141 **
PROGRAM_GROUPPHD -0.002408 0.205027 -0.012 0.990628
PROGRAM_GROUPTMP -0.328288 0.048848 -6.721 1.81e-11 ***
PREF_ADDR_TYPE_CODEALT -0.474587 0.173100 -2.742 0.006112 **
PREF_ADDR_TYPE_CODEBUS -0.012150 0.073983 -0.164 0.869555
HOUSEHOLD_CONTINENT -2.224402 0.152525 -14.584 < 2e-16 ***
HOUSEHOLD_CONTINENTAfrica -2.125180 1.058783 -2.007 0.044729 *
HOUSEHOLD_CONTINENTAsia -0.278599 0.088339 -3.154 0.001612 **
HOUSEHOLD_CONTINENTEurope -0.930010 0.123369 -7.538 4.76e-14 ***
HOUSEHOLD_CONTINENTOceania -1.299472 0.451479 -2.878 0.003999 **
HOUSEHOLD_CONTINENTSouth America -0.511160 0.174757 -2.925 0.003445 **
BUS_IS_EMPLOYEDTRUE 0.362305 0.060352 6.003 1.94e-09 ***
HAS_HOME_ADDRTRUE -0.167598 0.040203 -4.169 3.06e-05 ***
HAS_HOME_PHONETRUE -0.409968 0.041748 -9.820 < 2e-16 ***
GIVING_CASH_TOTAL 0.083497 0.041599 2.007 0.044726 *
sqrt(GIFTS_FYS_SUPPORTED) 0.441189 0.031399 14.051 < 2e-16 ***
sqrt(GIFTS_PLEDGES) -0.349652 0.038912 -8.986 < 2e-16 ***
CASH_PFY1 0.242021 0.030553 7.921 2.35e-15 ***
CASH_PFY2 -0.044281 0.031712 -1.396 0.162619
CASH_PFY3 0.021450 0.023286 0.921 0.356963
CASH_PFY4 0.225103 0.020043 11.231 < 2e-16 ***
CASH_PFY5 -0.311989 0.062847 -4.964 6.90e-07 ***
CRU_GIVING_SEGMENTDonor 1.360228 0.126125 10.785 < 2e-16 ***
CRU_GIVING_SEGMENTLapsed 0.112962 0.111556 1.013 0.311249
CRU_GIVING_SEGMENTLoyal 2 of 3 1.835777 0.125408 14.638 < 2e-16 ***
CRU_GIVING_SEGMENTLoyal 3+ 2.777647 0.153166 18.135 < 2e-16 ***
CRU_GIVING_SEGMENTLYBUNT 1.264283 0.135446 9.334 < 2e-16 ***
CRU_GIVING_SEGMENTNon 0.140704 0.123937 1.135 0.256256
CRU_GIVING_SEGMENTPYBUNT 0.634091 0.114802 5.523 3.33e-08 ***
sqrt(EVENTS_CFY) 0.147177 0.026535 5.547 2.91e-08 ***
ns(RECORD_YR, df = 5)1 4.063062 1.200700 3.384 0.000715 ***
ns(RECORD_YR, df = 5)2 5.005408 1.272925 3.932 8.42e-05 ***
ns(RECORD_YR, df = 5)3 3.368826 0.838618 4.017 5.89e-05 ***
ns(RECORD_YR, df = 5)4 8.969274 2.524985 3.552 0.000382 ***
ns(RECORD_YR, df = 5)5 2.255912 0.415268 5.432 5.56e-08 ***
YEARS_SINCE_MAX_CASH_YR -0.024679 0.002933 -8.414 < 2e-16 ***
GIVING_MAX_CASH_MO2 -0.007682 0.111631 -0.069 0.945137
GIVING_MAX_CASH_MO3 -0.023679 0.103255 -0.229 0.818617
GIVING_MAX_CASH_MO4 -0.062521 0.102225 -0.612 0.540800
GIVING_MAX_CASH_MO5 0.093697 0.097415 0.962 0.336134
GIVING_MAX_CASH_MO6 -0.155727 0.099125 -1.571 0.116179
GIVING_MAX_CASH_MO7 0.015812 0.108995 0.145 0.884651
GIVING_MAX_CASH_MO8 0.017199 0.096831 0.178 0.859022
GIVING_MAX_CASH_MO9 0.103307 0.121796 0.848 0.396327
GIVING_MAX_CASH_MO10 0.090505 0.113706 0.796 0.426056
GIVING_MAX_CASH_MO11 0.040718 0.105975 0.384 0.700812
GIVING_MAX_CASH_MO12 0.188005 0.091057 2.065 0.038952 *
LOYAL_5_PCT_CASH 5.758739 0.771286 7.466 8.24e-14 ***
SPOUSE_ALUMTRUE 0.401725 0.083494 4.811 1.50e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46782 on 76095 degrees of freedom
Residual deviance: 23592 on 76043 degrees of freedom
AIC: 23698
Number of Fisher Scoring iterations: 9
Again, similar AIC.
Fit a logistic regression model with the ridge penalizer using the same subset of variables chosen in the previous step.
glm_ridge_cv <- cv.glmnet(
rv.gave ~
PROGRAM_GROUP +
PREF_ADDR_TYPE_CODE +
HOUSEHOLD_CONTINENT +
BUS_IS_EMPLOYED +
HAS_HOME_ADDR +
HAS_HOME_PHONE +
# ns(YEARS_SINCE_FIRST_GIFT, 2) +
# GIVING_FIRST_YEAR_CASH_AMT +
# GIVING_MAX_PLEDGE_AMT +
GIVING_CASH_TOTAL +
# GIVING_PLEDGE_TOTAL +
# GIVING_CRU_TOTAL +
# sqrt(GIFTS_ALLOCS_SUPPORTED) +
sqrt(GIFTS_FYS_SUPPORTED) +
# sqrt(GIFTS_CASH) +
sqrt(GIFTS_PLEDGES) +
CASH_PFY1 +
CASH_PFY2 +
CASH_PFY3 +
CASH_PFY4 +
CASH_PFY5 +
CRU_GIVING_SEGMENT +
# EVALUATION_LOWER_BOUND +
# UOR_LOWER_BOUND +
# sqrt(MONTHS_ASSIGNED) +
# sqrt(COMMITTEE_NU_DISTINCT) +
# sqrt(COMMITTEE_NU_YEARS) +
# sqrt(COMMITTEE_KSM_DISTINCT) +
# sqrt(EVENTS_PREV_3_FY) +
sqrt(EVENTS_CFY) +
# sqrt(EVENTS_PFY1) +
# sqrt(ATHLETICS_TICKET_YEARS) +
# YEARS_SINCE_ATHLETICS_TICKETS +
ns(RECORD_YR, df = 5) +
YEARS_SINCE_MAX_CASH_YR +
GIVING_MAX_CASH_MO +
# KSM_PROSPECT +
# sqrt(VISITORS_5FY) +
LOYAL_5_PCT_CASH +
# UPGRADE3_CASH +
# VELOCITY3_LIN_CASH +
SPOUSE_ALUM
, data = traindat %>% mutate(
YEARS_SINCE_FIRST_GIFT = 2016 - ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017)
, YEARS_SINCE_ATHLETICS_TICKETS = 2016 - ifelse(ATHLETICS_TICKET_LAST > 0, ATHLETICS_TICKET_LAST, 2017)
, YEARS_SINCE_MAX_CASH_YR = 2016 - ifelse(GIVING_MAX_CASH_YR > 0, GIVING_MAX_CASH_YR, 2017)
)
, family = 'binomial'
, alpha = 0 # Ridge penalty
)
Compare coefficients between the penalized and unpenalized models.
full_join(
data.frame(var = coef(glm_st_splines) %>% names(), unpenalized = coef(glm_st_splines))
, data.frame(var = coef(glm_ridge_cv)[, 1] %>% names(), shrinkage = coef(glm_ridge_cv)[, 1])
, by = c('var', 'var')
) %>% gather(model, 'coefficient', 2:3) %>%
na.omit() %>%
arrange(abs(coefficient) %>% desc()) %>%
ggplot(aes(x = var %>% reorder(-abs(coefficient)), y = coefficient, color = model)) +
geom_hline(yintercept = 0, color = 'darkgray') +
geom_point(alpha = .5) +
scale_y_continuous(trans = 'neg_sqrt', breaks = c(-50, -40, -30, seq(-20, 20, by = 5), -2, -.5, .5, 2)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .3)
, panel.grid.minor = element_line(linetype = 'dotted')) +
labs(x = 'var')
The ridge penalty leads to fairly aggressive coefficient shrinkage.
# Holdout data with new variables
holdout_new <- holdoutdat %>% mutate(
YEARS_SINCE_FIRST_GIFT = 2016 - ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017)
, YEARS_SINCE_ATHLETICS_TICKETS = 2016 - ifelse(ATHLETICS_TICKET_LAST > 0, ATHLETICS_TICKET_LAST, 2017)
, YEARS_SINCE_MAX_CASH_YR = 2016 - ifelse(GIVING_MAX_CASH_YR > 0, GIVING_MAX_CASH_YR, 2017)
)
# Threshold
theta1 <- sum(traindat$rv.gave) / nrow(traindat)
# Calculations
tmp.ns <- conf_matrix(glm_standard, newdata = holdoutdat, threshold = theta1)
tmp.s <- conf_matrix(glm_st_splines, newdata = holdout_new, threshold = theta1)
tmp.rs <- conf_matrix_glmnet(glm_ridge_cv, newdata = holdout_new, rv = 'rv.gave', threshold = theta1)
# Data frame
model_compare <- cbind(
glm_baseline_err
, glm_nospline = c(tmp.ns$err, tmp.ns$prec, tmp.ns$sens, tmp.ns$F1)
, glm_spline = c(tmp.s$err, tmp.s$prec, tmp.s$sens, tmp.ns$F1)
, glm_ridge = c(tmp.rs$err, tmp.rs$prec, tmp.rs$sens, tmp.rs$F1)
)
remove(tmp.ns, tmp.s, tmp.rs)
print(model_compare)
With threshold \(\theta =\) 0.092 the glm_ridge model is the winner.
# Calculations
tmp.ns <- conf_matrix(glm_standard, newdata = holdoutdat)
tmp.s <- conf_matrix(glm_st_splines, newdata = holdout_new)
tmp.rs <- conf_matrix_glmnet(glm_ridge_cv, newdata = holdout_new, rv = 'rv.gave')
# Data frame
model_compare <- cbind(
glm_baseline_err
, glm_nospline = c(tmp.ns$err, tmp.ns$prec, tmp.ns$sens, tmp.ns$F1)
, glm_spline = c(tmp.s$err, tmp.s$prec, tmp.s$sens, tmp.ns$F1)
, glm_ridge = c(tmp.rs$err, tmp.rs$prec, tmp.rs$sens, tmp.rs$F1)
)
remove(tmp.ns, tmp.s, tmp.rs)
print(model_compare)
But with a decision threshold of \(\theta =\) 0.5 the standard glm performs somewhat better, minimizing false negatives.
Consider the calibration plots.
smooth.method <- 'loess'
glm_preds <- data.frame(
class = (holdoutdat[, 1] + 0) %>% unlist()
, ridge.baseline = predict(glm_ridge_baseline_model, newdata = holdout_new, type = 'response')
, nospline = predict(glm_standard, newdata = holdout_new, type = 'response')
, spline = predict(glm_st_splines, newdata = holdout_new, type = 'response')
, ridge = predict(glm_ridge_cv, newdata = holdout_new, type = 'response')
) %>% setNames(
c('class', 'ridge.baseline', 'nospline', 'spline', 'ridge')
) %>% gather(
'model', 'prediction', ridge.baseline:ridge
)
# Plotting
glm_preds %>%
ggplot(aes(x = prediction, y = class, group = model, color = model)) +
geom_point(color = 'black', alpha = .1) +
geom_smooth(method = smooth.method, alpha = .5) +
geom_abline(slope = 1, intercept = 0) +
labs(title = paste0('Predictions with OOS smoother (', smooth.method, ')'), color = 'model'
, x = 'predicted probability'
, y = 'observed probability')
Interestingly, out-of-box baseline ridge regression outperforms the ridge regression model with fewer explanatory variables. Between these four I’d take the spline glm due to its interpretability.
We can also look at the ROC curves.
rocdat <- cbind(model = 'ridge.baseline', roc_matrix_gen(glm_ridge_baseline_model, data = holdout_new)) %>%
rbind(cbind(model = 'nospline', roc_matrix_gen(glm_standard, data = holdout_new))) %>%
rbind(cbind(model = 'spline', roc_matrix_gen(glm_st_splines, data = holdout_new))) %>%
rbind(cbind(model = 'ridge', roc_matrix_gen(glm_ridge_cv, data = holdout_new)))
# Plot results
rocdat %>%
ggplot(aes(x = FPR, y = TPR, color = model)) +
geom_line(size = 1) +
geom_abline(slope = 1, intercept = 0, linetype = 'dashed', col = 'black') +
scale_x_continuous(breaks = seq(0, 1, by = .1), expand = c(0, 0)) +
scale_y_continuous(breaks = seq(0, 1, by = .1), expand = c(0, 0)) +
coord_equal() +
labs(title = 'ROC plot')
Computing the AUC:
data.frame(
ridge.baseline = with(
rocdat %>% filter(model == 'ridge.baseline')
, sum(1/nrow(holdoutdat) * TPR)
)
, nospline = with(
rocdat %>% filter(model == 'nospline')
, sum(1/nrow(holdoutdat) * TPR)
)
, spline = with(
rocdat %>% filter(model == 'spline')
, sum(1/nrow(holdoutdat) * TPR)
)
, ridge = with(
rocdat %>% filter(model == 'ridge')
, sum(1/nrow(holdoutdat) * TPR)
)
)
These are almost indistinguishable, but again the spline glm appears to be a reasonable choice.
I found previously that linear regression actually works pretty well for predicting cumulative giving amounts, particularly when conditioning on donor status (thereby excluding all 0 entries).
\[ E \left(\text{log giving | donor, covariates} \right) = E\left(\text{log}\left(Y_i\right)\right) = X_i \boldsymbol{\beta} \]
Here, \(Y_i = \text{FY18Giving}_i + \text{FY17Giving}_i\) and the training data \(X_i\) includes the observations where \(Y_i > 0\).
I’ll again pre-screen variables with the Boruta algorithm, but this time the response variable is continuous and only donors will be included.
# Sample rows
prop = 1 # Proportion of data to sample; 1 for all donors
set.seed(7968177)
# Include only entities who gave
samp <- sample_n(
modeling.data %>% filter(rv.gave)
, size = nrow(modeling.data %>% filter(rv.gave)) * prop
, replace = FALSE
) %>% select(
-rv.gave, -ID_NUMBER, -HOUSEHOLD_ID, -INSTITUTIONAL_SUFFIX, -DEGREES_CONCAT
)
# Run Boruta algorithm, but only on entities that gave
rf.vars.lm <- Boruta(
y = log10(samp$rv.amt)
, x = samp %>% select(-rv.amt)
, seed = 3371662
)
rf.vars.lm %>% print()
Boruta performed 99 iterations in 36.06102 mins.
83 attributes confirmed important: AF_PFY1, AF_PFY2, AF_PFY3, AF_PFY4, AF_PFY5 and 78
more;
49 attributes confirmed unimportant: ACTIVITIES_CFY, ACTIVITIES_PFY1, ACTIVITIES_PFY2,
ACTIVITIES_PFY3, ACTIVITIES_PFY4 and 44 more;
15 tentative attributes left: COMMITTEE_KSM_LDR, COMMITTEE_NU_DISTINCT, DAF_GIFTS_AMT,
EVENTS_PFY1, GIFT_CLUB_BEQUEST_YRS and 10 more;
Save the results.
save(rf.vars.lm, file = 'data/rf.vars.lm.Rdata')
(lmod_plot <- rf.vars.lm %>% Borutadata() %>% Borutaplotter())
Some unsurprising findings:
And some surprising ones:
(recommended.vars.lm <- TentativeRoughFix(rf.vars.lm))
Boruta performed 99 iterations in 36.06102 mins.
Tentatives roughfixed over the last 99 iterations.
92 attributes confirmed important: AF_PFY1, AF_PFY2, AF_PFY3, AF_PFY4, AF_PFY5 and 87
more;
55 attributes confirmed unimportant: ACTIVITIES_CFY, ACTIVITIES_PFY1, ACTIVITIES_PFY2,
ACTIVITIES_PFY3, ACTIVITIES_PFY4 and 50 more;
# Check variable correlations
recommended_vars_lm <- recommended.vars.lm$finalDecision[
which(recommended.vars.lm$finalDecision == 'Confirmed')] %>% names()
numeric_vars_lm <- modeling.data %>%
filter(rv.gave) %>%
select(recommended_vars_lm) %>%
select_if(is.numeric)
numeric_vars_lm %>% plot_corrs(textsize = 2)
The modeling data file is a slightly trimmed version of modeling.data.
# Data file with variables removed
lmdat <- modeling.data %>%
filter(rv.gave) %>%
select(rv.amt, recommended_vars_lm) %>%
select(
-GIVING_AF_TOTAL
, -KSM_GOS
, -contains('AF_PFY')
, -VELOCITY3_LIN_CASH
, -VELOCITY_BINS_NGC
, -VELOCITY_BINS_CASH
, -VELOCITY3_CASH
, -VELOCITY3_NGC
, -GIFTS_CASH
, -VISITORS_5FY
, -FIRST_KSM_YEAR
, -GIVING_PLEDGE_TOTAL
, -GIVING_MAX_CASH_FY
, -GIFTS_MATCHES
, -CRU_STATUS
, -BIRTH_DT
, -SPOUSE_SUFFIX
, -PAST_KSM_GOS_FLAG
, -GIVING_MAX_PLEDGE_MO
, -UPGRADE3_NGC
) %>% mutate(
# Create spouse flag
SPOUSE_ALUM = ifelse(SPOUSE_FIRST_KSM_YEAR > 0, 'TRUE', 'FALSE') %>% factor()
# Lump together little-used continents
, HOUSEHOLD_CONTINENT = fct_lump(HOUSEHOLD_CONTINENT, prop = .0075)
# Lump together little-used regions
, HOUSEHOLD_REGION = fct_collapse(
HOUSEHOLD_REGION
, 'NB' = 'INTL'
)
) %>% mutate_if(
# Numeric variables over 1E4 get a log10 transformation
function(x) {
ifelse(is.numeric(x), max(x) >= 1E4, FALSE)
}
, log10plus1
) %>% select(
-SPOUSE_FIRST_KSM_YEAR
)
# Cross-validation settings
folds = 10
reps = 5
# Withhold 10% of data as test set
xv <- KFoldXVal(lmdat, k = 2, prop = .1, seed = 1626452)
holdoutdat <- lmdat[xv[[1]], ]
traindat <- lmdat[xv[[2]], ]
remove(xv)
The model to beat will be a generic regression model with all variables previously identified as important.
# Store timings
timestamps <- list()
There were 18 warnings (use warnings() to see them)
# Store linear models
lm_baseline <- list()
# Seed for reproducibility
set.seed(2285286)
# Outer loop (repetitions)
for (rep in 1:reps) {
# Status report
timestamp <- paste('+ Iteration', rep, 'beginning at:', Sys.time())
print(timestamp)
timestamps <- c(timestamps, timestamp)
# Create cross-validation indices
xv <- KFoldXVal(traindat, k = folds)
# Inner loop (parallel cross-validation)
models_out <- foreach(
fold = 1:length(xv)
, .combine = list
, .multicombine = TRUE
, .packages = c('dplyr', 'splines')
) %dopar% {
# Fit temp model
tmpmodel <- lm(
rv.amt ~ .
# Train while withholding some data
, data = traindat[-xv[[fold]], ]
)
preds <- data.frame(
prediction = predict(tmpmodel, newdata = traindat[xv[[fold]], ], type = 'response')
, actual = traindat$rv.amt[xv[[fold]]]
)
# Return results
return(
list(
rep = rep
, fold = fold
, model = tmpmodel
, predictions = preds
)
)
}
# Write results to errors data frame
lm_baseline <- c(lm_baseline, models_out)
# Status report
timestamp <- paste(' -Iteration', rep, 'ending at: ', Sys.time())
print(timestamp)
timestamps <- c(timestamps, timestamp)
}
[1] "+ Iteration 1 beginning at: 2018-12-21 15:57:05"
[1] " -Iteration 1 ending at: 2018-12-21 15:57:06"
[1] "+ Iteration 2 beginning at: 2018-12-21 15:57:06"
[1] " -Iteration 2 ending at: 2018-12-21 15:57:08"
[1] "+ Iteration 3 beginning at: 2018-12-21 15:57:08"
[1] " -Iteration 3 ending at: 2018-12-21 15:57:09"
[1] "+ Iteration 4 beginning at: 2018-12-21 15:57:09"
[1] " -Iteration 4 ending at: 2018-12-21 15:57:10"
[1] "+ Iteration 5 beginning at: 2018-12-21 15:57:10"
[1] " -Iteration 5 ending at: 2018-12-21 15:57:11"
lm_models <- ListExtract(lm_baseline, 'model')
lm_preds <- ListExtract(lm_baseline, 'predictions')
n <- folds * reps
# r^2 plots
plot_r2(lm_models) +
geom_text(y = seq(10, 150, length.out = n), label = 1:n, color = 'blue')
The \(r^2\) results are good for this application (anything over .5 or so is reasonable).
plot_mses(lm_models, lm_preds)
This is a pretty good result, and the out-sample MSE is only slightly worse than the in-sample MSE.
plot_resids(lm_models, lm_preds)$insample
plot_resids(lm_models, lm_preds)$outsample
On average the model slightly overestimates on the low end and underestimates on the high end.
plot_qq(lm_models, lm_preds)$insample
plot_qq(lm_models, lm_preds)$outsample
The Q-Q plots support the earlier observation that there’s extra density in the tails.
Consider the model coefficients.
p.sig <- .05
plot_coefs(lm_models) +
guides(color = FALSE) +
scale_y_continuous(trans = 'neg_sqrt')
coef_pm_table(lm_models, p.sig)
The table indicates the number of models in which the coefficient was significantly positive or negative (or not signficantly different from 0) at the \(p =\) 0.05 level.